Data Transformation

dplyr basics

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.

Use filter() to filter rows

filter() allows you to subset observations based on their values.

filter(flights, month == 1, day == 1)

you can use comparision operator(==, !=, >, <, >=, <=) and logical operator(|, &, xor(,)) with filter. + you can use %in%

Exercises

#Q1
fly %>% filter(arr_delay >= 120)
fly %>% filter(dest == 'IAH'|dest == 'HOU')
fly %>% filter(carrier %in% c('UA', 'DL', 'AA'))
#Q2
fly %>% filter(between(arr_delay, 50, 100))#includes 50 and 100. 
fly %>%
  filter(is.na(dep_time)) %>%
  count()

Arrange rows with arrange()